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Abstract 

An increasing number of works explore collaborative 
human-computer systems in which human gaze is used to 
enhance computer vision systems. For object detection these 
efforts were so far restricted to late integration approaches 
that have inherent limitations, such as increased precision 
without increase in recall. We propose an early integra¬ 
tion approach in a deformable part model, which consti¬ 
tutes a joint formulation over gaze and visual data. We 
show that our GazeDPM method improves over the state- 
of-the-art DPM baseline by 4% and a recent method for 
gaze-supported object detection by 3% on the public POET 
dataset. Our approach additionally provides introspection of 
the learnt models, can reveal salient image structures, and 
allows us to investigate the interplay between gaze attracting 
and repelling areas, the importance of view-specific models, 
as well as viewers ’ personal biases in gaze patterns. We 
finally study important practical aspects of our approach, 
such as the impact of using saliency maps instead of real 
fixations, the impact of the number of fixations, as well as 
robustness to gaze estimation error. 

1. Introduction 

Across many studies, human gaze patterns were shown 
to reflect processes of cognition, such as intents, tasks, or 
cognitive load, and therefore represent a rich source of in¬ 
formation about the observer. Consequently, they have been 
successfully used as a feature for predicting the user’s in¬ 
ternal state, such as user context, activities, or visual atten¬ 
tion [1, 2, 3, 15]. Recent advances in eye tracking tech¬ 
nology [26, 13, 32, 28, 33] open up a wide range of new 
opportunities to advance human-machine collaboration (e.g. 
[22]) or aid computer vision tasks, such as object recogni¬ 
tion and detection [29, 19, 12]. The overarching theme is 
to establish collaborative human-machine vision systems in 
which part of the processing is carried out by a computer and 
another part is performed by a human and conveyed to the 
computer via gaze patterns, typically in the form of fixations. 
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Figure 1: A recent late integration approach for gaze-supported 
object detection [29] learns from image and gaze information sepa¬ 
rately (bottom). In contrast, our GazeDPM method enables early 
integration of gaze and image information (top). 


Yun et al. recently applied this approach to object detec¬ 
tion and showed how to improve performance by re-scoring 
detections based on gaze information [29]. However, image 
features and gaze information were processed independently 
and only the outputs of the two pipelines were fused. This 
constitutes a form of late integration of both modalities and 
comes with inherent limitations. For example, the re-scoring 
scheme can improve precision but cannot improve recall. 
Also, exploitation of dependencies between modalities is 
limited as two separate models have to be learned. 

In contrast, we propose an early integration scheme us¬ 
ing a joint formulation over gaze and visual information 
(see Figure 1). We extend the deformable part model [7] to 
combine deformable layouts of gradient and gaze patterns 
into a GazeDPM model. This particular model choice allows 
for rich introspection into the learned model and direct com¬ 
parison to previous work employing late integration. Our 
analysis reveal salient structures, interplay between gaze 
attracting and repelling areas, importance of view-specific 
models as well as personal biases of viewers. As we have 
highlighted the emerging opportunities of applying such col- 
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laborative schemes in applications, we further study and 
quantify important practical aspects, such as benefits of hu¬ 
man gaze data over saliency maps generated from image 
data only, temporal effects of such a collaborative scheme, 
and noise in the gaze measurements. 

The specific contributions of this work are threefold: First, 
we present the first method for early integration of gaze in¬ 
formation for object detection based on a deformable part 
models formulation. In contrast to previous late integration 
approaches where gaze information is only used to re-score 
detections, we propose a joint formulation over gaze and 
visual information. Second, we compare our method with 
a recent late integration approach [29] on the publicly avail¬ 
able POET dataset [19]. We show that our early integration 
approach outperforms the late integration approach in terms 
of mAP by 3% and provides a deeper insight into the model 
and properties of the data. Third, we present and discuss 
additional experiments exploring important practical aspects 
of such a collaborative human-computer systems using gaze 
information for improved object detection. 

2. Related Work 

Our method is related to previous works on 1) deformable 
part models for object detection, 2) visual saliency map esti¬ 
mation, and 3) the use of gaze information in collaborative 
human-computer vision systems. 

Deformable Part Models One of the most successful ap¬ 
proaches for object detection over the last decade is the 
deformable part models [7]. Deformable part models con¬ 
stitute of a set of linear filters that are used to detect coarse 
representation of an object and refine detections using filters 
that respond to specific details of objects being detected. 
Because of their simplicity and ability to capture complex 
object representations, many extensions of deformable parts 
models have been proposed in literature [6], where usage 
of more complex pipelines or better features allows to im¬ 
prove detection performance. Recently, improved detection 
performance was demonstrated using neural network based 
approaches [20, 8]. In this work we opted to build on de¬ 
formable part models because they allow for better intro¬ 
spection and have the potential to better guide future work 
on the exploration of gaze information in computer vision. 
Introspection in deep architecture is arguably more difficult 
and topic of ongoing research [24, 30]. 

Visual Saliency Map Estimation Saliency estimation 
and salient object detection algorithms can be used for scene 
analysis and have many practical applications [10]. For ex¬ 
ample, they allow to estimate probability of observer fixating 
on some area in image and thus allow to model which parts 
of depicted scene attracts attention of a human observer. A 


variety of different saliency approaches were developed, like 
graph based visual saliency [9], boolean map approach [31], 
and recent approaches using neural networks [27] that allow 
to estimate saliency maps or detect most salient objects for a 
given scene or video segment. 

To evaluate saliency algorithms, many datasets contain¬ 
ing images and eye tracking information from a number of 
observers are available. For example, eye tracking data is 
available in [11, 29] for free viewing task, in a large POET 
dataset [19] for a visual search task, and in [14] for evalu¬ 
ation of saliency algorithms on video sequences. The ac¬ 
quisition of gaze data can be achieved via a wide range of 
methods [26, 13, 32, 28, 33] - which is not part of our inves¬ 
tigation, although we do evaluate robustness withi respect to 
noise in the gaze data. We evaluate our work on the existing 
POET dataset and investigate in how far saliency maps can 
substitute real gaze data in our approach. 

Collaborative Human-Computer Vision Systems 

There has recently been an increasing interest in using gaze 
information to aid computer vision tasks. For example, 
fixation information was used to perform weakly supervised 
training of object detectors [19, 12], analysing pose estima¬ 
tion tasks [16], inferring scene semantics [25], detecting 
actions [17], or predicting search tasks [22]. Our approach 
more specifically targets the use of gaze data for object 
detection. The most closely related work to ours is [29], 
where gaze information was used to re-score detections 
produced by a deformable part model. In contrast, the 
proposed GazeDPM approach integrates gaze information 
directly into deformable part models and therefore provides 
a joint formulation over visual and gaze information. We 
further consider saliency maps as a substitute for real gaze 
data. Similar ideas can be found in [21, 18], where it was 
demonstrated that saliency maps can be used to improve 
object detection performance. 

3. Gaze-Enabled Deformable Part Models 

To formulate a joint model of visual and gaze information 
we build on the established deformable part models (DPM) 
[7]. In contrast to recent developments in deep learning, 
this particular model allows for better model introspection. 
Deformable part models predict bounding boxes for object 
categories in an image (e.g. a bicycle) based on visual fea¬ 
tures. In this section we describe necessary background and 
our extension of deformable part models towards our new 
GazeDPM formulation, which is used in further sections for 
gaze-enabled object detection. 

3.1. Visual Feature Representation 

DPMs use feature maps for object detection in images. 
Feature maps are arrays of feature vectors in which every 
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Figure 2: Comparison of vision-only deformable part model (DPM) on the left and gaze-enabled deformable part model (GazeDPM) on the 
right. 



Figure 3: Example fixation density map overlayed on the corre¬ 
sponding image from the POET dataset. White dots represent 
fixations by different observers; color close to red indicates high 
fixation density, and close to blue small density. Note that gaussians 
around each fixation are weighted by fixation duration. 


feature vector contains local information that corresponds to 
some patch in an image (e.g. the average direction or mag¬ 
nitude of derivative). To enable DPMs to detect objects on 
different scales, a feature pyramid is used, which consists of 
feature maps computed from an image on different scales. 
Throughout this work we use a 31 dimensional feature repre¬ 
sentation as described in [7] that we obtained by analytical 
reduction of HOG features [5] in addition to gaze features. 

3.2. Fixation Density Maps 

Gaze information is available as two-dimensional coor¬ 
dinates of observers’ fixations on an image as well as their 
duration. We encode sequences of these fixations into fixa¬ 
tion density maps that proved useful for many tasks. This 


representation was used in prior work [29] and we follow the 
same approach here. For every image a fixation density map 
is obtained by pixelwise summation of values of weighted 
Gaussian functions placed at every fixation position in an 
image and normalizing the resulting map to values in a range 
from 0 to 1. Every Gaussian function in a sum corresponds 
to normal distribution function, with mean equal to fixation 
coordinates and covariance matrix as a diagonal matrix with 
values of on diagonal, where a is selected to be 7 % of 
image height. Weight of the Gaussian function is selected 
to be corresponding fixation duration. Normalization of 
fixation map is obtained by dividing the values of sum of 
weighted Gaussian functions by its maximum value. This 
representation is equally applicable if real gaze fixations 
are not available and a saliency map algorithm is used as a 
substitute [10, 9, 31]. These methods also produce a density 
map that tries to mimic an actual fixation density map pro¬ 
duced by real fixation data. A sample fixation density map 
obtained from real fixations is shown in Figure 3, where it is 
overlay ed onto the corresponding image. 

3.3. Deformable Part Models 

Deformable parts models are star models defined by a root 
filter that is used to detect the coarse, holistic representation 
of the whole object, and part filters that are used to detect 
individual parts of an object. Root filter detections are used 
to determine an anchor position and the score of root and 
part filters together with deformation coefficients are used 
to compute a detection score with latent part placement. 
The score maps of the root and part filters are computed by 
convolution of feature maps in the feature pyramid. 

Each part filter is anchored at some position relative to 
the root filter. Let P denote set of all possible locations in 
image. For a given deformable part model and some location 
Po e P in an image, the overall score of detection in this 
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Figure 4: Example detections on images of four different classes from the POET dataset [19]. For every triple of images, left is the image 
with detections of the original DPM, center is the image with GazeDPM detections, and right is the density map generated from the fixations. 
True positive detections are shown in green, false positive detection in red. 


location is given by 


s{po) = ro{po) + max ri{pi) - di{po,Pi), (1) 

Pl...Pn 


where G P, i G [n], n G N, function s{po) is a score of 
DPM positioned in image at position po, tq is a response of 
root filter at po, ri{pi) is a response of filter that corresponds 
to the part that is located at position pi, and di{po^pi) is a 
displacement penalty. 

A sliding window approach is used to detect objects with 
this model. For every position in the image and at every scale, 
an optimal placement of parts is determined by maximizing 
the score function. If the found score is above a threshold, the 
hypothesis that object is present in bounding box is accepted. 
As part placements are latent, a latent SVM formulation 
is used for training [7]. DPMs can then be expressed as a 
classifier of the following form: 


f 0 {x)= max (2) 

zez{x) 

where x is an image feature map, Z{x) is a set of all possible 
sliding windows in an image, (3 is Si vector that contains 
weights of all linear filters weights and displacement costs, 
and T>(x, z) is a features subset that corresponds to some 
sliding window ^ G Z{x). Then, training DPM corresponds 
to finding parameters (3 of linear filters in DPM such that 
they minimize objective 

L{I3) = \\I3\\1 + - Vifpixi)), (3) 

iEm 

where pi is a label that indicates if an instance of a class 
is present in image Xi and there are m G N such images 


available. For more information regarding optimization of 
above function as well as on other details please refer to [7]. 

3.4. Integration of Gaze Information 

We extend the original DPM formulation by adding ad¬ 
ditional parts that are trained on a new fixation density map 
feature channel: 

fl^ix) = max (/?, $(a;, z)) + {l3', ^'{x, z)), (4) 

zEZ{x) 

where /3' corresponds to parameters of linear filters that are 
applied to fixation features T>'(x, z). We call the resulting 
extended DPM “GazeDPM”. We refer to this method of in¬ 
tegration of gaze information as “early integration” given 
that fixation data is directly used in the DPM. This is in 
contrast to a recent work [29] that used a “late integration 
approach” by using fixation information not directly in the 
DPM but to refine its detections. Figure 2 provides visuali¬ 
sations comparing these two different integration methods. 
The overall detection score for GazeDPM model applied at 
some location po in an image can be computed as 

s'{po) = Ro{po) + max V Ri{pi) - di{po,Pi), (5) 

Pl...Pn 

ie[n] 

where Ri{pi) 

Ri{Pi) = ri{pi) +r-(pi), (6) 

iG{0, (7) 

denotes joint response of linear filter r • {pi) applied to gaze 
features and response of linear filter ri{pi) applied to image 
features at position pi e P and i = 0 denotes root filter, 
i G {1,..., n} denotes part filter. 
























3.5. Implementation 

We implemented our GazeDPM model based on the MAT- 
LAB implementation of the original deformable part models 
provided with [7]. In order to ensure reproducability and 
stimulate research in this area we will make code and models 
publicly available at time of publication. In the following, 
we provide experimental evaluation of our GazeDPM model 
in different settings, compare to prior work, and provide 
additional insights and analysis into the learnt models. 

4. Experiments on POET 

We first compare our GazeDPM method to the early inte¬ 
gration approach proposed in [29]. All of the experiments 
presented in this section were performed on the Pascal Ob¬ 
jects Eye Tracking (POET) dataset [19]. This dataset con¬ 
tains eye tracking data for 10 classes of the original Pascal 
VOC 2012 dataset. Eye tracking data was collected from ob¬ 
servers whose task was to find one of Pascal classes present 
in the image (visual search task). We split the dataset into 
training and testing sets of approximately equal number of 
class instances (approx. 3000 testing and training images) 
and use this split throughout all of experiments below, unless 
specified otherwise. 

For evaluating the performance of the models we use the 
evaluation code provided with the VOC dataset [6]. For 
all experiments we used two aspect ratio clusters for DPM 
detectors and default thresholds. We found experimentally 
that two clusters yielded the best performance for stock 
DPMs and our modification on POET dataset and therefore 
used these settings throughout the experiments. For more 
detailed account of experimental results described in this 
section please refer to supplementary material. 

4.1. Early vs Late Integration 

We re-implemented the late integration method proposed 
in [29] with assistance of the authors. Evaluation results 
are provided in Table 1. For comparison, we also show the 
unmodified DPM performance. As can be seen from the 
table, our reimplementation of the late integration scheme 
achieves an improvement of 0.4% which is consistent with 
the results published in [29]. Our GazeDPM achieves an 
overall performance of 34.7%, which is a 3.9% improvement 
over the late integration [29] and 4.3% improvement over 
the DPM baseline. These results provide evidence for the 
benefits of an early integration scheme and joint modelling 
of visual and fixation features. 

Notice that for the late integration of gaze information 
[29] one needs to have a baseline dpm model, which should 
be trained on images other than those that are used as training 
and testing set for gaze classifier procedure. In the exper¬ 
iment described in this section we however used training 
set for both gaze classifier procedure and dpm training. To 
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Figure 5: Example salient structures in images (a, b). Notice that 
people tend to not fixate on animal neck. Compare to learned 
weights of gaze filters in Table 2 

account for this, we established a more rigorous comparison 
setup which is more favorable for late integration, which for 
brewity we do not describe here (see supplement), but for 
which we still get an improvement with GazeDPM of 3 % 
mAP compared to late integration. 

4.2. Further Analysis of GazeDPM 

To gain deeper insights, we further analyzed and visual¬ 
ized the GazeDPM models that we trained. 

4.2.1 Analysis of Example Detections 

Example detections and error cases are shown in Figure 4. 
Given that fixation information is quite informative regard¬ 
ing the class instance, it allows our method to obtain more 
true positive detections (see Figure 4a) and remove some 
false positives (see Figure 4c). We observed that there was a 
tendency towards bounding boxes covering most of fixations 
as can be seen in Figure 4b. This indicates that the model 
has learnt that fixations are a strong indicator for object pres¬ 
ence, which is exploited by our GazeDPM method. This 
assumption can be violated, as observers also produce spuri¬ 
ous fixations in search tasks. For some cases such fixation 
related to exploration of the image can create false positives 
as can be seen in Figure 4d. 

4.2.2 Learning Salient Structures 

When looking at some of the part visualizations and 
corresponding fixation density maps we realized that 
GazeDPM models are able to learn salient structures for 
different categories and aspect ratio clusters. For example, 
for the cat category. Figure 5 shows some example images 
with positions of observer fixations. It is well known that 
people tend to fixate on heads of animals [29] and this can 
be seen by visualizing the distribution of fixations for the cat 
class (see Table 2). We also found that people usually tend to 
not fixate on animal neck (see Figure 5). This is reflected in 
the resulting GazeDPM models by a strong positive weight 
(see Table 2) which acts like a “gaze attractor” at location 
where the animal head is located, and by a strong negative 




DPM [29] GazeDPM Amount of noise Participant-specific fixations 


Class 

original 

late 

early 

0 . 50-2 

<j2 

1.50-2 

2(t2 

PI 

P2 

P3 

P4 

P5 

cat 

23.9 

24.0 

40.2 

39.0 

33.1 

33.0 

29.1 

36.3 

34.4 

35.4 

36.1 

33.4 

cow 

22.6 

22.6 

24.9 

20.3 

21.1 

18.6 

21.9 

21.3 

22.7 

21.2 

19.1 

20.1 

dog 

14.7 

15.2 

28.2 

23.5 

23.2 

15.5 

15.8 

24.6 

18.5 

22.8 

22.5 

23.4 

horse 

43.9 

44.0 

46.0 

44.5 

42.6 

40.7 

40.7 

43.0 

46.5 

41.8 

43.9 

43.8 

aeroplane 

41.8 

42.3 

40.6 

42.4 

42.2 

44.2 

44.8 

40.4 

38.9 

42.4 

39.9 

43.8 

bicycle 

53.5 

53.8 

53.5 

53.9 

51.9 

52.6 

52.8 

53.4 

52.6 

52.6 

53.0 

52.5 

boat 

8.4 

8.4 

9.3 

10.1 

8.7 

7.2 

8.8 

10.0 

10.3 

7.1 

9.3 

7.8 

diningtable 

19.8 

21.8 

30.0 

30.8 

15.0 

13.0 

23.1 

24.3 

26.2 

18.5 

27.9 

26.1 

motorbike 

48.5 

48.7 

45.9 

46.1 

46.4 

47.1 

46.4 

44.3 

43.4 

44.2 

44.4 

46.1 

sofa 

26.7 

27.4 

28.5 

32.8 

31.1 

25.8 

24.3 

29.1 

27.1 

29.2 

29.4 

23.7 

Average 

30.4 

30.8 

34.7 

34.3 

31.5 

29.8 

30.8 

32.7 

32.1 

31.5 

32.5 

32.1 


Table 1: Performance comparison of all three methods (original DPM [7], late integration [29], and our GazeDPM) on the POET dataset. 
For all modifications two aspect ratio clusters were used. PI ... P5 means only fixations from that specific participant were used to generate 
fixation density maps. Columns with multiples of denote performance with fixation density maps generated with different amounts of 
noise to simulate the influence of low-accuracy gaze estimation settings. 


weight which acts like a “gaze repellent” in the area where 
animal neck is located. We like to draw the attention to the 
root filter of one component model in the gaze density map 
with negative weight close to the neck of animal. In this way 
our GazeDPM model tries to exploit such salient structures, 
present in training data. Similar effects can be seen on gaze 
parts filters; However, as parts filters can be shifted, they 
appear to be located in such a way so as to account for differ¬ 
ent locations of peaks (animal head) in fixation map specific 
for an image. Looking across the learnt filters, we see an in¬ 
teresting interaction between areas on the object that attract 
fixations and close-by regions that are not fixated. The latter 
areas can be seen as “gaze repellents”, which might be due a 
shadowing effect of the close-by attractor. 

4.2.3 Learning View-Specific Information 

It is also well-known that different DPM components cor¬ 
respond roughly to different viewpoints on an object. To 
analyse this for our model, we computed fixation density 
maps conditioned on category and associated component 
of the corresponding DPM models and compared them to 
fixation density maps only conditioned on the category. The 
sample distributions for the “cat” class in Table 2 show that 
the component-conditioned fixation density maps and the 
learnt models differ for the two component model as the 
component conditioned densities contain view-specific infor¬ 
mation. Specifically, the mode of the fixation density map 
is located in the upper half where the head of animals are 


usually located. For different viewpoints the mode location 
and thus fixation distributions change. Due to the early inte¬ 
gration, our GazeDPM model can exploit this information 
and we attribute part of its success to the viewpoint-specific 
fixation density modelling. This can also be seen by com¬ 
paring gaze parts of the two component model with gaze 
parts of the one component model. Part gaze filters for one 
component have weights distributed more evenly in order 
to account for different distributions of different views on 
object. On a coarser scale of root filters the opposite de¬ 
pendency holds, as fixation distribution is more stable for 
coarser scale among different views and thus is more useful 
for the single cluster model. 

4.3. Performance on Fixation Subsets 

Gaze information available in the POET dataset is col¬ 
lected from five observers. In many practical applications 
only a smaller amount of fixation information might be avail¬ 
able, such as only from one user or collected for a shorter 
amount of time. To study performance of the GazeDPM in 
these conditions, we run a series of studies on subsets of 
fixations available for each image in the POET dataset. 

4.3.1 Influence of Number of Fixations 

We first sampled a random subset of fixations from all avail¬ 
able fixations for an image and used these to generate fixation 
density maps. GazeDPM was trained and evaluated on these 
fixation maps and the results are shown in Eigure 6. We 



# of comp Fixation density map Gradient root Gradient parts Gaze root Gaze parts Deformations 



Table 2: Comparison of marginal and component conditioned fixation density maps and corresponding GazeDPM models (column 3 to 7). 
In the visualizations of gaze filters color close to blue represents negative values, close to red - positive values. 


found that using only 11 fixation - which is less than half 
the available amount in POET - only leads to a reduction of 
1% in mAP compared to using all available fixations, which 
is still an 3% improvement compared to the DPM baseline. 
Notably, even only three fixations can already be helpful 
to yield more than 1% improvement compared to the base¬ 
line DPM. For a more complete account of the experiments, 
please refer to the supplementary material. 

4.3.2 Influence of Order of Fixations 

To investigate the importance of the order of fixations we 
further sub-sampled fixations but keeping their temporal 
order. We then trained and evaluated the performance of 
GazeDPM with the first n G {1, 3, 7,11,15,19, 23} fixa¬ 
tions and with last n G {1, 3, 7,11,15,19, 23} fixations. As 
shown in Figure 6, the last 7 fixations are more informative 
than the first 7 fixations. It turns out that the last fixations 
are more likely to be on the target object due to visual search 
of the observers. In particular, using the last 7 fixations - 
which is a third of all available fixations - already results in 
more than 2% improvement compared to the baseline DPM. 

4.3.3 Influence of User-Speciflc Fixations 

In many practical use cases, only fixations for a small number 
of users - most often just a single user - are available. Con¬ 
sequently, we trained GazeDPM models on fixation maps 
generated using fixations of a specific user. Performance 
results of these models are shown in Table 1. On average, 
we got an improvement of 2% mAP in the single user set¬ 
ting over the baseline DPM. Note that similar improvement 
is obtained with fixation maps generated from 5 randomly 
sampled fixations (see Figure 6) which is an average num¬ 
ber of fixations in POET dataset for a single observer per 


image. This suggests that fixations for different observers 
are roughly equally informative. Although the average per¬ 
formance of GazeDPM is about equal for different users, 
for specific classes performance can be quite different (e.g. 
“aeroplane” category performance for user 5). This suggests 
biases in individual fixation patterns or search strategies for 
specific users. Additional experiments revealed that training 
on as little as two users can be enough for performance to be 
only 1% below training on all users. 

4.4. Robustness to Gaze Estimation Error 

All results that we showed so far were obtained using 
fixation data that is subject to some small amount of noise. 
The noise is caused by the inherent and inevitable gaze es¬ 
timation error in the eye tracker used to record the data. 
Data collection in [29] was performed using a high-accuracy 
remote eye tracker. However, for many practical applica¬ 
tions other sensors become increasingly interesting, such 
as ordinary monocular RGB cameras that are readily inte¬ 
grated into many mobile phones, laptops, and interactive 
displays [34, 32, 28]. In these settings, fixation information 
can be expected to contain substantially more noise due to 
even lower gaze estimation accuracy of these methods. We 
therefore analyzed how robust our GazeDPM model is to 
simulated noise in the fixation data. 

The method proposed in [23] achieved a gaze estima¬ 
tion accuracy of roughly 3 ± 3 degrees of visual angle. We 
used this accuracy as a starting point for our investigations 
of robustness to gaze estimation error. Note, though, that 
this accuracy highly depends on the hardware and scenario. 
In [23] observers were seated 75cm away from a screen with 
a size of 28 cm by 18 cm. We assumes that POET images 
were scaled proportionally to fit on the screen. Accord¬ 
ingly, we first translated fixation coordinates for all images 
into the centimeters of the screen. We then translated accu- 
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Figure 6: Performance of GazeDPM in terms of mAP using images of POET dataset and fixation density maps, either estimated using 
graph-based visual saliency (GBVS) [9] or boolean map saliency (BMS) v2 [31], or generated from fixations sampled by different criteria 
from all available fixations for an image. 


racy in degrees to accuracy in centimeters of screen surface, 
which is 75 tan(3) ± 75 tan(3). We used Gaussian noise to 
approximately model the noise due to gaze estimation er¬ 
ror. Specifically, as distribution parameters of the Gaussian 
noise we selected /i = 0 and by rule of 3 sigmas we set 
0-2 = tan(3+3) ^ most of resulting Gaussian noise 

would result in at most 6 degrees of visual angle error. For 
comparison, we also considered fixations with added Gaus¬ 
sian noise for different multiples of . We generated noise 
under these assumptions and added it to fixation coordinates, 
expressed in screen coordinate system. Then we computed 
fixation density maps from these noisy fixations and trained 
GazeDPM on them. 

Results of this evaluation are shown in Table 1. Even 
with a noise level of a2 we still obtain improvement of 
around 1% compared to the baseline DPM. Such cr2 roughly 
corresponds to an average shift of fixation coordinates of 
approximately ± 20 % pixels of the image height. With 
larger values of the improvement vanishes. For a smaller 
noise level of 0.5cr2 ^ performance within 0.4% of 

the measured fixations without noise. This shows that our 
GazeDPM method is robust to small noise levels and yields 
improvement up to medium noise levels. 

4.5. Experiments Using Saliency Maps 

Although the core of our investigation is centered around 
the use of real fixation data, we finally investigated if our 
method can also be used in the absence of such data, i.e. if the 
fixation density maps are replaced with saliency maps calcu¬ 
lated using state-of-the-art saliency methods.Specifically, we 
used graph-based visual saliency (GBVS) [9] and boolean 
map saliency (BMS) [31], which both perform very well by 


different metrics on the MIT saliency benchmark [4]. 

As can be seen from Figure 6 our GazeDPM model 
achieved an improvement of 0.8% mAP for GBVS and 1% 
mAP using BMS compared to the baseline DPM. We hy¬ 
pothesize that improvements stem from global features in 
the saliency map that the local HOG descriptor in the DPM 
does not have access to. We also observed that the obtained 
improvement is roughly consistent with the improvement ob¬ 
tained by one real fixation. Although both saliency maps per¬ 
formed comparable in this setting, for some object categories 
like “cat” there was a significant performance difference of 
up to 10% (see supplementary material). 

5. Conclusion 

In this work we have presented an early integration 
method that improves visual object class detection using 
human fixation information. At the core of our approach 
we have proposed the GazeDPM as an extension to the 
well-known deformable part model that constitutes a joint 
formulation over visual and fixation information. We have 
obtained an improvement of 4.3% of mAP compared to a 
baseline DPM and around 3.9% compared to a recent late 
integration approach. Further, we have studied a range of 
cases of practical relevance that are characterized by limited 
or noisy eye fixation data and observe that our approach 
is robustness to many such variations which argues for its 
particability. Besides the quantitative results, we have found 
that the intraspection gained by visualizing the trained mod¬ 
els has led to interesting insights and opens an avenue to 
further study and understand the interplay between fixation 
strategies and object cognition. 
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6. Supplementary Material 
6.1. Multiple fixation maps 

In this sections, experiments are described where instead 
of single fixation map generated from fixations multiple 
fixation maps are used, generated from fixations separated 
by certain criterion, like fixation from certain viewing time. 

We normalize viewing time for every user separately, 
to account for possible differences in user reaction time. 
Specifically, for all fixations for specific user, we compute 
average over all viewing times (as viewing time we use 
time of the end of last fixation) for all images, normalize all 
fixation times using this value. We do not use viewing time 
for 3 images in each class that were viewed first, to avoid 
outliers. 

6.1.1 Fixation length and viewing time as separation 
criterion 

Soft binning was used to separate fixations for different 
saliency maps channels. Fixation length is normalized by 
viewing time to account for differences in reaction time, 
and results are shown in Table 10. K-means was used to 
determine clusters based on fixation duration and viewing 
time when fixation was made. 


Class 

DPM 

2 cl. 

3 cl. 

4 cl. 

GazeDPM 

cat 

23.9 

37.4 

36.7 

36.3 

40.2 

cow 

22.6 

22.0 

21.2 

19.9 

24.9 

dog 

14.7 

23.6 

26.2 

23.5 

28.2 

horse 

43.9 

41.9 

44.5 

41.4 

46.0 

aeroplane 

41.8 

36.9 

40.4 

32.4 

40.6 

bicycle 

53.5 

52.8 

53.1 

52.3 

53.5 

boat 

8.4 

11.1 

10.2 

8.6 

9.3 

diningtable 

19.8 

17.5 

17.5 

12.8 

30.0 

motorbike 

48.5 

41.9 

41.0 

40.6 

45.9 

sofa 

26.7 

26.4 

29.4 

23.1 

28.5 

Average 

30.4 

31.1 

32.0 

29.1 

34.7 


Table 10: Performance of gaze enabled dpm modification in terms 
of mAP on images of POET dataset using different number of fixa¬ 
tions saliency map features, compared to baseline performance (no 
gaze information, DPM coumn) and when single fixation map fea¬ 
ture is used (GazeDPM column). K-means was used to determine 
clusters based on fixation duration and viewing time when fixation 
was made. Fixation duration for specific participant of POET data 
collection is normalized by average viewing time. 


6.1.2 Fixation length as separation criterion 

Soft binning was used to separate fixations for different 
saliency maps channels. Fixation length is normalized by 
viewing time to account for differences in reaction time, 
and results are shown in Table 11. K-means was used to 
determine clusters based on fixation duration. 


Class 

DPM 

2 cl. 

3 cl. 

GazeDPM 

cat 

23.9 

37.4 

36.7 

40.2 

cow 

22.6 

22.6 

21.6 

24.9 

dog 

14.7 

26.1 

24.5 

28.2 

horse 

43.9 

42.5 

42.8 

46.0 

aeroplane 

41.8 

43.4 

37.6 

40.6 

bicycle 

53.5 

54.5 

53.2 

53.5 

boat 

8.4 

9.2 

8.3 

9.3 

diningtable 

19.8 

25.2 

22.1 

30.0 

motorbike 

48.5 

46.1 

42.4 

45.9 

sofa 

26.7 

29.3 

26.9 

28.5 

Average 

30.4 

33.6 

31.6 

34.7 


Table 11: Performance of gaze enabled dpm modification in terms 
of mAP on images of POET dataset using different number of fixa¬ 
tions saliency map features, compared to baseline performance (no 
gaze information, DPM coumn) and when single fixation map fea¬ 
ture is used (GazeDPM column). K-means was used to determine 
clusters based on fixation duration. Fixation duration for specific 
participant of POET data collection is normalized by average view¬ 
ing time. 


As the number of fixations per image is small, soft bin¬ 
ning based on similarity was used. Centroids from k-means 
algorithm are used as centers of bins for certain gaze feature. 
To determine contriution of certain fixation to a bin with 
centroid c G C, where C C is a set of n G N centroids, 
the following formula is used 

/7 N s(d,c) 

a{d,c)= ^ (8) 

c'ec 

sid, c) = Af{d, a)(c) = (9) 

aVZTT 

a = 0.025 (10) 

where d is a fixation length, c - centroid that corresponds 
to a certain saliency map feature, a{d, c) - contribution of 
fixation with length d to gaze feature with centroid c. 



Clusters 

DPM 

Gaze 

Noise 

Zero 

2 

30.2 

34.7 

30.8 

29.8 

3 

29.9 

34.5 

29.6 

29.7 

4 

27.6 

31.5 

28.0 

28.4 


Table 3: Comparison of performance of different modifications of DPM with different number of clusters. ’Gaze’ corresponds to Gaze DPM 
used with fixation maps generated from real fixations, ’Noise’ corresponds to Gaze DPM used with fixation map where every value is set to 
zero, ’DPM’ is unchanged implementation of dpm library [7], ’Noise’ corresponds to performance of Gaze DPM with fixation maps filled 
with uniform noise. 


Class 

DPM 

100 ms 

(Ifx) 

200 ms 

(3fx) 

300 ms 

(9fx) 

400 ms 

(13fx) 

600 ms 

(18fx) 

800 ms 

(22fx) 

All 

cat 

23.9 

28.8 

32.1 

37.7 

37.5 

37.0 

37.5 

40.2 

cow 

22.6 

20.3 

23.3 

21.9 

20.6 

23.7 

24.5 

24.9 

dog 

14.7 

18.1 

15.5 

21.1 

23.8 

25.3 

27.7 

28.2 

horse 

43.9 

43.5 

44.7 

45.3 

43.7 

45.3 

45.6 

46.0 

aeroplane 

41.8 

45.6 

44.6 

43.0 

41.8 

42.4 

43.8 

40.6 

bicycle 

53.5 

52.2 

53.1 

53.6 

53.7 

55.8 

56.6 

53.5 

boat 

8.4 

7.3 

8.2 

10.2 

8.5 

9.9 

10.2 

9.3 

diningtable 

19.8 

18.8 

19.1 

24.4 

27.4 

22.7 

26.9 

30.0 

motorbike 

48.5 

46.9 

45.2 

45.1 

44.8 

44.0 

45.6 

45.9 

sofa 

26.7 

27.3 

31.5 

26.6 

30.4 

28.5 

32.3 

28.5 

Average 

30.4 

30.9 

31.7 

32.9 

33.2 

33.5 

35.1 

34.7 


Table 4: Performance of gaze enabled dpm modification in terms of mAP on images of POET dataset using different number of fixations 
sampled until a certain viewing time. For each column, the corresponding average number of fixations for viewing time is specified. 



Class 

All 

100 ms 

(24fx) 

200 ms 

(22fx) 

300 ms 

(17fx) 

400 ms 

(12fx) 

600 ms 

(8fx) 

800 ms 

(3fx) 

DPM 

cat 

40.2 

37.6 

37.4 

38.6 

38.6 

37.0 

28.5 

23.9 

cow 

24.9 

23.6 

22.3 

26.0 

22.2 

24.5 

23.2 

22.6 

dog 

28.2 

28.9 

26.0 

24.1 

22.3 

15.5 

20.9 

14.7 

horse 

46.0 

45.7 

45.2 

44.4 

45.6 

43.8 

42.9 

43.9 

aeroplane 

40.6 

41.7 

42.4 

41.9 

40.2 

43.4 

46.7 

41.8 

bicycle 

53.5 

55.5 

55.8 

53.4 

53.8 

53.7 

53.8 

53.5 

boat 

9.3 

10.4 

9.5 

10.7 

9.7 

9.7 

7.5 

8.4 

diningtable 

30.0 

26.6 

22.6 

30.0 

27.7 

26.3 

24.9 

19.8 

motorbike 

45.9 

46.2 

42.7 

44.6 

43.6 

47.5 

46.8 

48.5 

sofa 

28.5 

31.5 

30.9 

29.2 

28.8 

24.4 

27.8 

26.7 

Average 

34.7 

34.8 

33.5 

34.3 

33.2 

32.6 

32.3 

30.4 


Table 5: Performance of gaze enabled dpm modification in terms of mAP on images of POET dataset using different number of fixations 
sampled after a certain viewing time. For each column, the corresponding average number of fixations for viewing time is specified. 


Class 

DPM 

1 fix-s 

2 fix-s 

3 fix-s 

7 fix-s 

11 fix-s 

15 fix-s 

19 fix-s 

23 fix-s 

All 

cat 

23.9 

32.9 

35.3 

36.1 

35.6 

37.5 

36.4 

37.3 

38.2 

40.2 

cow 

22.6 

19.7 

21.4 

18.9 

22.4 

24.0 

21.9 

22.0 

23.7 

24.9 

dog 

14.7 

19.8 

22.7 

24.5 

21.9 

24.9 

26.2 

26.1 

25.1 

28.2 

horse 

43.9 

44.5 

43.6 

43.9 

43.4 

43.8 

42.3 

43.9 

44.1 

46.0 

aeroplane 

41.8 

42.7 

42.8 

39.5 

42.3 

40.2 

39.3 

40.9 

40.6 

40.6 

bicycle 

53.5 

51.6 

54.0 

50.8 

53.1 

54.5 

53.6 

55.7 

55.3 

53.5 

boat 

8.4 

6.0 

7.4 

6.2 

9.8 

9.6 

9.4 

10.0 

9.8 

9.3 

diningtable 

19.8 

21.3 

22.7 

22.8 

26.2 

29.3 

27.4 

27.8 

24.6 

30.0 

motorbike 

48.5 

47.2 

42.2 

45.4 

46.4 

46.6 

45.2 

45.5 

46.2 

45.9 

sofa 

26.7 

24.8 

24.9 

26.1 

28.2 

28.3 

30.9 

31.1 

28.7 

28.5 

Average 

30.4 

31.1 

31.7 

31.4 

32.9 

33.9 

33.3 

34.0 

33.6 

34.7 


Table 6: Performance of gaze enabled dpm modification in terms of mAP on images of POET dataset using different number of fixations 
sampled randomly from all available fixations. 



Class 

DPM 

Ifx 

3 fx 

7fx 

11 fx 

15 fx 

19 fx 

23 fx 

All 

cat 

23.9 

29.4 

29.8 

35.8 

33.7 

38.4 

38.3 

36.4 

40.2 

cow 

22.6 

24.0 

19.2 

22.0 

19.8 

21.9 

24.4 

23.2 

24.9 

dog 

14.7 

18.4 

18.3 

18.1 

20.2 

26.0 

24.1 

25.4 

28.2 

horse 

43.9 

43.1 

44.5 

45.3 

46.1 

47.0 

45.4 

44.0 

46.0 

aeroplane 

41.8 

46.5 

42.7 

44.7 

43.9 

43.0 

42.8 

43.1 

40.6 

bicycle 

53.5 

52.5 

54.7 

54.7 

53.2 

54.2 

54.7 

55.7 

53.5 

boat 

8.4 

8.1 

9.3 

9.0 

11.1 

9.6 

10.7 

10.9 

9.3 

diningtable 

19.8 

20.3 

22.8 

19.7 

22.9 

23.2 

23.4 

21.9 

30.0 

motorbike 

48.5 

46.6 

47.6 

43.6 

44.1 

46.6 

45.8 

47.1 

45.9 

sofa 

26.7 

24.0 

26.3 

23.6 

24.1 

32.1 

29.3 

31.5 

28.5 

Average 

30.4 

31.3 

31.5 

31.7 

31.9 

34.2 

33.9 

33.9 

34.7 


Table 7: Performance of gaze enabled dpm modification in terms of mAP on images of POET dataset using different number of first fixations. 


Class 

DPM 

Ifx 

3 fx 

7fx 

11 fx 

15 fx 

19 fx 

23 fx 

All 

cat 

23.9 

33.8 

35.1 

34.7 

39.7 

36.6 

35.5 

38.7 

40.2 

cow 

22.6 

20.4 

19.2 

21.9 

22.0 

21.8 

19.8 

22.4 

24.9 

dog 

14.7 

14.8 

22.4 

24.9 

23.1 

23.6 

28.6 

26.7 

28.2 

horse 

43.9 

43.6 

43.2 

43.3 

44.2 

44.8 

46.4 

45.7 

46.0 

aeroplane 

41.8 

43.9 

45.3 

47.0 

41.9 

41.2 

42.4 

39.5 

40.6 

bicycle 

53.5 

52.8 

53.2 

51.7 

53.8 

54.0 

54.6 

54.6 

53.5 

boat 

8.4 

8.1 

8.1 

9.5 

11.4 

9.1 

10.5 

11.8 

9.3 

diningtable 

19.8 

17.3 

25.9 

26.9 

28.7 

19.8 

21.9 

22.7 

30.0 

motorbike 

48.5 

47.1 

45.8 

45.9 

47.4 

44.1 

44.8 

43.5 

45.9 

sofa 

26.7 

25.5 

26.0 

28.1 

27.6 

30.8 

27.2 

23.6 

28.5 

Average 

30.4 

30.7 

32.4 

33.4 

34.0 

32.6 

33.2 

32.9 

34.7 


Table 8: Performance of gaze enabled dpm modification in terms of mAP on images of POET dataset using different number of last fixations. 



Class 

DPM 

GBVS 

BMSV2 

cat 

23.9 

30.2 

30.2 

cow 

22.6 

21.4 

23.8 

dog 

14.7 

24.5 

17.5 

horse 

43.9 

40.0 

42.6 

aeroplane 

41.8 

40.9 

40.5 

bicycle 

53.5 

53.3 

53.9 

boat 

8.4 

9.5 

7.4 

diningtable 

19.8 

20.4 

22.3 

motorbike 

48.5 

44.4 

48.9 

sofa 

26.7 

27.7 

25.7 

Average 

30.4 

31.2 

31.3 


Table 9: Performance of Gaze DPM on images of POET dataset with generated fixation maps using BMS V2, using maps from BMS 
corresponding to salient object detection, maps generated from POET fixations and unmodified dpm library. For all modifications 2 aspect 
ratio clusters are used. ’DPM’ is unchanged implementation of dpm library. 



